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Parallel  Language  Constructs  for  Tensor  Product 
Computations  on  Loosely  Coupled  Architectures* 

Piyush  Mehrotra^ 

John  Van  Rosendale 

Institute  for  Computer  Applications  in  Science  and  Engineering 

Hampton,  VA  23665. 


Abstract 

Distributed  memory  architectures  offer  hign  levels  of  performance  and  flexibility, 
but  have  proven  awkward  to  program.  Current  languages  for  nonshared  memory  archi¬ 
tectures  provide  a  relatively  low-level  programming  environment,  and  are  poorly  suited 
to  modular  programming,  and  to  the  construction  of  libraries.  This  paper  describes 
a  set  of  language  primitives  designed  to  allow  the  specification  of  parallel  numerical 
algorithms  at  a  higher  level.  We  focus  here  on  tensor  product  array  computations,  a 
simple  but  important  class  of  numerical  algorithms.  We  consider  first  the  problem  of 
programming  one  dimensionah"kernel” ^routines,  such  as  parallel  tridiagonal  solvers, 
and  after  that  look  at  how  such  parallel  kernels  can  be  combined  to  form  parallel  tensor 
product  algorithms.  'yr-+f>  ,f  i  , 
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1  Introduction 


Distributed  memory  architectures  offer  very  high  levels  of  performance  at  modest  cost. 
Machines  now  becoming  available  have  peak  speeds  of  many  gigaflops,  at  a  fraction  of  the 
cost  of  high-end  vector  multiprocessors,  and  teraflop  machines  will  be  possible  in  a  few  years. 
Given  this  performance  potential,  it  is  clear  that  such  architectures  will  play  an  increasing 

role  in  scientific  computing. 

However,  there  remains  a  fundamental  problem  with  distributed  memory  architectures; 
they  tent:  to  be  quite  awkward  to  program.  Parallel  programming  is  generally  harder  than 
seuuentiai  programming,  since  the  programmer  needs  to  be  aware  of  the  multiple  threads  of 
control  :iov,  and  the  subtle  semantic  issues  which  can  arise[6j.  Nonshared  memory  architec¬ 
tures  compound  this  difficulty  by  forcing  the  programmer  to  decompose  each  data  structure 
into  separate  pieces,  each  "owned'1  by  one  of  the  processors.  Also,  on  such  architectures, 
all  interprocessor  communication  must  be  explicitly  specified  using  the  low-level  message 
passing  constructs  supported  by  the  architecture. 

Some  of  the  difficulty  of  programming  distributed  memory  machines  seems  to  be  inherent 
in  this  class  of  architectures,  though  a  large  fraction  appears  to  be  due  to  a  lack  of  adequate 
programming  tools.  The  Kali  project  combines  work  of  the  authors  with  other  research  on 
programming  and  load  balancing  being  carried  out  at  ICASE  [2,  12,  18].  The  ultimate  goal 
is  to  ameliorate  the  problem  of  programming  distributed  memory  architectures  to  the  extent 
possible. 


Nature  of  the  problem 

This  paper  focuses  on  the  problem  of  expressing  tensor  product  array  computations,  for 
distributed  memory  architectures.  Tensor  product  algorithms  are  those  in  which  multidi¬ 
mensional  arrays  are  manipulated  by  applying  operations  to  lower  dimensional  slices.  Such 
algorithms  offer  both  simplicity  and  efficiency.  Thus  tensor  product  algorithms  are  widely 
used  in  spline  fitting,  in  picture  processing,  in  computer  aided  geometry,  in  computational 
fluid  dynamics,  and  so  forth. 

Despite  this  diver?  ity  of  applications,  the  programming  issues  in  tensor  product  algo¬ 
rithms  are  universal.  One  needs  effective  ways  of  pealing  off  lower  dimensional  slices  of 
arrays,  and  one  needs  to  be  able  to  apply  one  of  a  number  of  different  operations  on  those 
slices.  In  some  cases,  the  algorithm  applied  to  the  slices  will  also  be  a  tensor  product 
algorithm,  (c.f.  section  5.).  When  these  tensor  product  algorithms  are  implemented  on  dis¬ 
tributed  memory  machines,  both  the  original  array,  and  the  operations  on  the  slices  may  be 
distribute.  This  is  the  issue  addressed  in  this  paper. 

E hiding  th**  r-rnm  way  of  expressing  tensor  product  algorithms  on  distributed  mem¬ 
ory  architectures  is  a  basic  problem.  Tensor  product  algorithms,  and  array  manipulation 
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algorithms  more  generally,  are  a  natural  target  for  parallel  computing  research.  Virtually 
all  large  numerical  programs  involve  array  manipulation  in  some  way.  Moreover,  the  reg- 
ularitv  of  the  operations  involved,  and  the  potential  for  highly  efficient  execution,  makes 
these  algorithms  especially  susceptible  to  development  of  effective  language  constructs  and 
programme ng  techniques. 

The  plan  of  this  paper  is  as  follows.  In  section  2,  we  discuss  distributed  memory  nro- 
grnmming,  and  current  work  in  the  the  Kali  project.  Section  3  considers  the  programming  of 
one  dimensional  kernel  routines,  such  as  tridiagona!  solvers.  After  that  sections  4  and  5  look 
at  how  these  one  dimensional  kernels  can  be  combined  into  higher  dimensional  algorithms 
for  realistic  numerical  computations.  Finally,  section  6  briefly  discusses  the  adequacies  and 
limitations  of  the  present  approach,  and  directions  in  which  research  appears  necessary. 

2  Parallel  Programming  Constructs 

Most  programming  environments  for  distributed  memory  architectures  are  based  on  “mes¬ 
sage  passing  languages.”  The  basic  paradigm  for  such  languages  is  CSP  (Communicating 
Sequential  Processes)!?’,  in  which  independently  executing  sequential  tasks  interact  and  syn¬ 
chronize  through  messages.  Examples  include  Occam(20j,  and  message  passing  dialects  of  C 
and  Fortran  as  supplied  by  machine  manufacturers. 

In  one  sense,  these  message  passing  languages  are  ideal,  since  they  accurately  embody 
the  set  of  operations  which  can  be  performed  by  the  hardware.  However,  there  are  also 
serious  limitations  with  such  languages;  the  most  obvious  is  the  relatively  low-level  at  which 
algorithms  must  be  specified.  The  programmer  first  needs  to  distribute  the  data  structures 
across  a  set  of  processes,  each  having  its  own  separate  name-space.  Then,  in  order  to 
share  values  between  processes,  the  programmer  must  specify  a  “send”  operation  in  one 
process,  and  a  matching  “receive”  operation  in  another.  This  kind  of  programming  is  more 
challenging  than  one  would  expect,  induces  a  mass  of  low-level  “message  passing”  detail  for 
even  the  simplest  algorithms,  and  can  easily  lead  to  deadlock  or  non-determinancy  if  one  is 
not  sufficiently  careful[6]. 

Another  problem  exists  with  these  languages  as  well.  Modern  program  design  relies 
heavily  on  “modular  programming”  and  “top  down”  design.  One  builds  up  large  programs 
as  a  collection  of  “modules”  or  “procedures”  each  designed  to  perform  a  specific  subtask. 
Message  passing  languages  provide  no  support  for  this  kind  of  decomposition.  One  can  define 
“procedures”  running  on  each  processor,  but  there  is  no  notion  of  a  “distributed  procedure" 
running  op.  a  collection  of  processors. 
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The  KALI  Project 


i' he  Kali  project  is  an  attempt  to  define  tools  (languages,  compilers,  performance  pre- 
ciictors,  etc.)  to  allow  the  specification  of  programs  for  distributed  memory  architectures 
at  a  higher  level,  without  compromising  run-time  efficiency.  This  project  is  one  of  sev¬ 
eral  projects  addressing  the  basic  issue  of  high-level  programming  of  distributed  memory 
architectures:!,  21’.  To  achieve  the  efficiency  and  generality  of  message  passing  languages, 
while  still  permitting  high  level  specification  of  algorithms  is  a  difficult  and  multi-faceted 
problem.  In  :T7’  we  looked  at  the  problem  of  programming  iterative  algorithms  utilizing 
irregular  triangular  grids,  on  distributed  memory  architectures.  In  this  paper,  we  look  at 
the  equally  important  problem  of  programming  very  regular  “tensor  product”  calculations 

such  architectures 

In  this  paper,  we  describe  the  language  constructs  we  are  pioposing  in  a  Fortran-like 
d ia’ect.  KF1  (Kali  Fortran  1).  Jur  previous  papers  have  generally  used  a  more  Pascal  like 
syntax  16,  17'.  However,  syntax  is  not  the  issue.  Since  most  numerical  programmers  are 
more  comfortable  with  a  Fortran-like  syntax,  we  are  adopting  this  syntax  in  our  current 

To  illustrate  the  basic  concepts  in  our  approach,  we  consider  the  simple  problem  of  spec¬ 
ifying  a  Jacobi  iteration  for  a  distributed  memory  architecture.  Listing  I  gives  a  sequential 
version  of  a  Jacobi  algorithm  for  Poisson’s  equation  on  an  n  by  n  grid.  This  is  a  simple  al¬ 
gorithm  which  can  be  easily  rewritten  in  a  message  passing  language  for  parallel  execution. 
A  high  level  view  of  this  kind  of  code  is  given  in  Listing  2. 

In  the  message  passing  version  of  the  Jacobi  code  as  shown  in  Listing  2,  the  algorithm 
is  assumed  to  be  distributed  over  a  p2  array  of  processors.  The  data  is  divided  into  m  x  m 
blocks,  where  m  —  n/p,  so  that  each  processor  contains  a  contiguous  subarray  of  the  full 
solution  array.  Note  that  the  actual  array  is  declared  to  be  (m+2)  x  (m  +  2)  so  that  boundary 
data  can  be  easily  maintained  and  accessed.  Such  a  distribution  of  the  array  data  structure 
keeps  most  of  the  array  references  in  the  computation  local  while  balancing  the  load  across 
the  processors.  The  values  on  the  boundaries  of  the  subarray  “owned”  by  each  processor 
need  to  be  communicated  to  each  of  its  four  neighbors  at  each  iteration;  hence  the  sequence 
of  guarded  sends  and  receives  shown.  The  “if”  guards  are  needed,  since  processors  “owning” 
solution  values  along  physical  boundaries  of  the  processor  array  do  not  send  or  receive  these 
values  from  their  neighboring  processors.  Also,  the  communication  here  has  been  assumed  to 
be  asynchronous;  if  the  underlying  architecture  requires  synchronous  communications  then 
tiie  sends  and  receives  have  to  be  ordered  carefully  to  avoid  deadlock. 

The  KF1  code  for  this  algorithm,  given  in  Listing  3,  looks  very  much  like  the  sequential 
Fortran  code,  though  it  will  compile  into  message-passing  code  analogous  to  that  in  List¬ 
ing  2.  In  addition  to  the  code  in  the  sequential  Fortran  version,  the  user  must  specify  three 
addibonal  kinds  of  information  in  KF1: 
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parameter  (np  =  ...) 

real  X(0:np,  0:np),  f(0:np,  0:np) 

real  tmpX(0:np,  0:np) 

n  =  np  —  1 
do  1000  it  =  1,  50 
c 

c  copy  solution  into  a  temporary  array 

c 

do  200  j  =  1,  n 
do  100  i  =  1,  n 

tmpXl’i,  j)  =  X(i,  j) 

100  continue 

200  continue 

c 

c  update  solution  array 

c 

do  400  j  =  1,  n 
do  300  i  =  1,  n 

X(i,  j)  =  0.25*(tmpX(i-rl,  j)  +  tmpX(i-l,  j) 

&  +  tmpX(i,  j+1)  4-  tmpX(i,  j— 1))  -  f(i,  j) 

300  continue 

400  continue 

1000  continue 


Listing  1:  Sequential  Jacobi  algorithm 


a)  the  processor  array  on  which  the  program  is  to  be  executed, 

b)  the  distribution  of  the  data  structures  across  these  processors,  and 

c)  the  parallel  loops  specifying  the  computation  on  the  distributed  data  structures. 

These  are  aspects  of  the  parallel  algorithm  that  are  critical  to  performance.  Automatic 
generation  of  such  information  from  sequential  code  is  beyond  current  compiler  technology, 
and  hence  it  must  be  supplied  by  the  programmer. 

Processor  Array 

The  first  thing  that  needs  to  be  specified  is  a  “processor  array.”  This  is  an  array  of 
physical  processors  across  which  the  data  structures  will  be  distributed,  and  on  which  the 
algorithm  will  execute.  In  the  jacobi  routine,  the  processor  array  is  passed  in  as  an  argument. 
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Code  for  process  Plip.jp)  ... 


parameter  (mp  =  ...) 

real  X(0:mp,  0:mp),  f(0:mp,  0:mp) 

real  tmpX(0:mp,  Oar.p) 

m  -  mp  —  1 
do  1000  it  =  1,  50 
c 

c  copy  interior  of  solution  array  into  a  temporary  array 

c 

do  200  j  =  1,  m 
do  100  i  =  1,  m 
tmpX(i,  j)  =  X(i,  j) 

100  continue 

200  continue 

c 

c  send  edge  values  to  North,  South,  East  and  West  neighbors 

c 

if(ip.gt.  1  )  «end  (  P(ip— 1,  jp),  X(l,  l:m)  ) 

if  (  ip  .le.  np  )  send  (  P(ip  +  1,  jp),  X(m,  l:m)  ) 

if  (  jP  gt-  1  )  send  (  P(ip,  jp—  1),  X(l:m,  1)  ) 

if  (  jp  .le.  np  )  send  (  P(ip,  jp+1),  X(l:m,  m)  ) 

c 

c  receive  edge  values  from  neighbors 

c 

if  (  jp  .le.  np  )  reev  (  P(ip,  jp+1),  tmpX(l:m,  mp)  ) 

if  (  jp  .gt.  1  )  reev  (  P(ip,  jp-1),  tmpX(l:m,  0)  ) 

if  (  ip  .le.  np  )  reev  (  P(ip+1,  jp),  tmpX(mp,  l:m)  ) 

if  (  ip  gt-  1  )  reev  (  P(ip—  1,  jp),  tmpX(0,  l:m)  ) 

c 

c  update  solution  array  X 

c 

do  400  j  —  1,  m 
do  300  i  =  1,  m 

X(i,  j)  =  0.25*(tmpX(i+l,  j)  +  tmpX(i-l,  j) 

&  +  tmpX(i,  j+1)  +  tmpX(i,  j-1))  -  f(i,  j) 

300  continue 

400  continue 

1000  continue 


Listing  2:  Message-passing  version  of  Jacobi  algorithm 


parsub  jacobi(X,  f,  nn:  j>rocs) 
processors  procs(p,  p) 

real  X(0:np,  0: up) ,  f(0  up,  0:np)  dist  (block,  block) 

n  —  r.p  —  i 

do  '000  it  ^  1,  50 

c 

c  copy  solution  into  a  tt  m penary  array 

c 

doall  100  (i,  j)  --  :1,  n:  *  [  1 ,  if  on  o\vner(x(i,  j)) 

X(i,  j)  =  0.25*(X(i-r-l,  j)  -  X(i  —  1,  j)  -r  X(i,  j~i)  -t-  X(i,  j-1))  -  f(i,  j) 
100  con.inue 

1000  continue 

ret  urn 
end 


Listing  3:  KF1  version  of  the  Jacobi  algorithm 


The  routine  declares  the  argument  procs  to  be  a  two  dimensional  processor  array,  having 
size  p  by  p.  The  size  of  the  processor  array  argument  is  “open”,  and  is  determined  by  the 
actual  size  of  the  processor  array  passed  at  the  point  of  call.  The  identifier  p  reflects  this 
size,  and  can  be  used  in  the  body  of  the  subroutine  as  a  constant.  The  keyword  parsub 
declares  jacobi  to  be  a  parallel  subroutine,  which  will  execute  in  parallel  on  distributed  data. 
A  processor  argument  can  be  passed  only  to  parallel  subroutines  and  each  parallel  subroutine 
must  either  declare  a  processor  array  or  must  be  passed  one  as  an  argument. 

Only  one  “real”  processor  declaration  is  allow'ed  in  the  whole  program,  and  it  generally 
occurs  in  the  main  program.  The  processor  array  (or  its  slices)  can  then  be  passed  around 
as  a  parameter  for  parallel  execution  of  subroutines.  The  initial  processor  declaration  is  the 
“real  estate  agent”  discussed  by  C.  Seitz.  The  semantics  and  behavior  of  the  real  estate 
agent  is  interesting,  but  is  tangential  to  the  issues  here. 


Data  Distribution  Primitives 

(liven  a  processor  array,  the  programmer  must  specify  the  distribution  of  data  structures 
across  that  processor  array.  In  the  current  version  of  KF1,  the  only  distributed  data  type 
supported  is  distributed  arrays.  Array  distributions  are  specified  by  a  distribution  clause 
in  their  declaration.  This  clause  specifies  a  sequence  of  distribution  patterns,  one  for  each 
dimension  of  the  array.  Scalar  variables  and  arrays  without  a  distribution  clause  are  simply 
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cated  with  one  copy  assimmu  to  each  of  tiie  processors  in  the  processor  array. 

cm.cusi.wi  «.■:  a  data  array  can  he  distributed  across  the  processor:;  in  one  of  several 
rum  or  can  r-e  left  urn:  A:  rinu’.ed.  In  this  subroutine,  the  data  array  X  is  distributed 
u  ^  block ,  block)  uistr. nut :..n.  1' rr is  means  that  each  dimension  is  “blocked,”  so  that 
process,  o  receivi  s  a  square  subarray  of  the  full  array  .V.  Another  kind  of  distribution  is 
flic  osnec :al’v  use:::!  :n  numerical  linear  algebra,  in  which  the  elements  are 

mu;-  i  a  r  u::u-r. on  mmiion  across  tile  processors.  The  number  of  dimensions  of  an 
r  t:.at  are  cm*,  nbuteu  ::.ust  ::.atc:>  t:.e  number  of  dimensions  of  the  underlying  processor 
A  stem  sis  ore  useu  to  imucate  dimensions  ul  data  arrays  which  are  not  distributed. 

mu:  o'ubtue;  data  structures  are  passed  as  arguments  to  parallel  subroutines,  f he 
:  yr  cessors  owning  the  port;,  ms  «>>  trie  data  structures  being  passed,  needs  to  be  passed 
..  mus.  ;  »r  example,  passing  a  slice  of  a  distributed  array  often  entails  passing  a 
i.u.g  slice  the  processor  array.  This  idea  should  become  clearer  through  the  examples 


Doall  hoops 


vane 


•:.<  on  (iistributec:  da ‘a  structures  are  specified  by  doall  loops.  The  doall  loon 
a.:  to  t;te  forall  b  ■  •  ■  :t>  m  BI.AZF.  16  .  and  to  parallel  loops  in  other  parallel  dialects 
1  he  example  b-dow  shows  a  lot  .p  which  performs  u  —  1  loop  invocations,  shifting 


•  :te  space 


doall  1 f •! i  i  -  I,  n-1  on  owner; Afijj 


Ami  -  Af'i-i  i 


b*f)  continue 


The  semantics  here  are  “copy-in  copy-out,”  in  the  sense  that  the  values  on  the  right 
hand  side  of  the  assignment  are  the  old  values  in  array  A,  before  being  modified  by  the  loop. 
Tims  the  array  .1  is  effectively,  “copied  into"  each  invocation  of  the  doall  loop,  and  then  the 
changes  are  "copied  out.”  Thus  no  temporary  array  is  required  in  the  jaco'oi  routine  given 


In  addition  to  the  range  specification  in  the  header  of  the  doall,  there  is  also  an  on 
clause.  This  clause  specifies  the  processor  on  which  each  loop  invocation  is  to  be  executed, 
in  the  above  program  fragment,  the  on  clause  causes  the  uh  loop  invocation  to  be  executed 
or.  the  processor  owning  the  jth  element  of  the  array  .4. 

The  on  clause  associated  with  a  doall  loop  allows  the  compiler  to  partition  the  loop 
invocations  among  the  processors  participating  in  the  loop.  This  process,  called  “strip- 


’ .  is  fairly  simple  given  al'  the  information  available  to  the  compiler  [12,  13].  Note 
'1  at  the  code  outside  the  doall  loops  is  replicated  in  each  of  the  processors. 

The  loop  headers  of  purely  nested  doall  loops  can  be  combined  into  a  single  header  as 
shown  in  Listing  3  for  the  two  loops  of  the  jacobi  routine.  Here,  a  product  of  the  ranges  is 
used  to  specify  that  for  each  value  of  the  outer  loop  variable  i,  in  the  range  [1,  n\  the  inner 
loop  variable  j  assumes  each  of  the  values  in  the  range  [1,  n\. 

Specifying;  Communication 

L’sing  K  F 1 .  a  programmer  can  specify  a  data  parallel  algorithm  at  a  high  level,  while  still 
retaining  control  over  those  details  crhical  to  performance.  The  additional  information 
re. pared  of  the  user  here  is  exactly  the  information  most  critical  to  performance.  Note  that 
•hue  b  .  dy  of  the  doall  loop  here  is  independent  of  the  distribution  of  the  array  .V  and  of  the 
nr  cos-,  u  array  P.  Tims  a  variety  of  distribution  patterns  can  be  tried  by  simple  modifications 
this  program.  This  makes  ’Tuning”  of  parallel  programs  much  easier  with  this  kind  of 
.a:.g  u-.gv  than  it  ;s  with  message  passing  languages. 

It  is  important  to  note  that  KF1  contains  no  explicit  communication  constructs.  The 

mummer  sped  lies  distribution  of  data  values  across  the  processors,  and  also  specifies  the 
i  rati"::  where  operations  are  to  be  performed.  From  this,  the  compiler  produces  the  low- 
level  details  of  the  message  passing  code  to  be  executed  on  the  architecture  by  a  sequence 
program  transformations  .12,  17]. 

In  general,  given  an  assignment  statement,  like  that  in  the  Jacobi  example,  the  compiler 
can  decide  which  processor  “owns”  each  of  the  values  on  the  lefi  and  right  hand  sides  of  the 
equation,  and  can  then  generate  efficient  message  passing  code.  This  is  true  of  all  of  the 
examples  in  this  paper,  and  of  most  others  we  have  studied.  In  other  cases,  the  compiler 
must  generate  runtime  code  which  will  gather  such  information  on  the  fly [  1 7 ] . 

However,  this  issue  of  how  communication  is  expressed  is  subtle;  it  is  not  clear  which 
approach  is  best.  The  choice  m  KFl,  of  leaving  communication  implicit,  appears  natural 
from  the  users  point  of  view,  since  it  dramatically  simplifies  programming  It  also  greatly 
simplifies  “tuning’’  of  parallel  programs,  and  allows  a  “modular”  or  “top  down”  design 
strategy  which  is  impossible  with  Occam-style  languages.  However,  there  is  a  serious  defect 
here;  benign  looking  code  will  some* irncs  run  exceptionally  slowly.  This  is  because  the 
compile*-  cannot  always  generate  effective  message  passing  executable  code,  particularly  for 
complicated  loops. 

We  plan  to  address  tins  issue  by  providing  performance  estimation  tools,  which  will 
indicate  which  parts  of  a  program  will  compile  into  efficient  executable  code,  and  which  will 
n  o.  (liven  such  a  tool,  a  program. ner  should  have  little  trouble  designing  efficient  programs. 
In  the  end.  however,  the  effect iven^ss  of  this  kind  of  language  will  depend  on  many  factors, 
including  the  communications  capabilities  of  architectures,  the  quality  of  compilers,  and  the 
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needs  of  working  programmers.  Resolving  these  issues  is  an  area  of  active  research. 


3  Parallel  Tridiagonal  Solvers 

In  this  section,  we  describe  the  programming  of  a  parallel  algorithm  for  tridiagonal 
systems  of  equations,  using  the  KFl  primitives  of  the  last  section.  Solving  tridiagonal 
systems  is  a  common  "kernel  algorithm”  for  multi-dimensional  tensor  product  algorithms. 
Other  "one-dimensional  kernels”  frequently  needed  are  cubic  spline  fitting  routines,  Fast 
Fourier  fransforms,  and  so  forth,  but  tridiagonal  solvers  are  the  most  commonly  used.  From 
a  programming  point  of  view,  all  oi  these  kernel  algorithms  are  similar,  and  most  can  be 
treated  by  analogous  "divide  and  conquer”  techniques  on  parallel  architectures. 

Consider  the  problem  of  solving  a  tridiagonal  matrix  of  equations  of  size  n.  Let  .-1  be  the 
tridiagonal  matrix  whose  fth  row  has  nonzero  elements  (f>„  a,,  ct),  as  shown  in  Figure  i. 
We  seek  the  solution  A’  of  the  tridiagonal  system, 

A  X  =  f 

assuming  that  the  matrix  .1  can  be  factored  without  pivoting. 

There  are  a  wide  variety  of  parallel  tridiagonal  algorithms  in  the  literature[8].  The 
particular  one  described  here  is  a  “substructured”  algorithm,  which  is  a  variant  of  Sameh’s 
"spike”  algorithm^].  This  algorithm  is  a  tree-structured  “divide  and  conquer”  algorithm 
executing  on  p  processors.  In  the  first  phase  of  the  algorithm,  we  perform  a  sequence  of 
l°rji(p)  reduction  steps,  each  halving  the  size  of  the  tridiagonal  system  being  solved.  In  the 
second  phase  of  the  algorithm,  we  perform  substitution  to  obtain  the  solution. 

The  matrix  A  is  assumed  to  be  distributed  by  blocks  of  rows  across  the  p  processors. 
Given  this  distribution  of  the  array,  processor  i  is  responsible  for  row’s  /,  =  (z  —  1  )n/p  -f  1 
through  u,  ~  iv.jp.  In  the  first  step,  processor  i  performs  elimination  on  rows  lx  T  2  through 
u,,  eliminating  the  lower  diagonal  of  the  tridiagonal  system,  but  introducing  fill-in  in  column 
/,.  Next,  it  performs  elimination  in  the  reverse  direction  on  rows  Uj  —  2  through  eliminating 
the  upper  diagonal,  while  introducing  fill-in  in  column  ux.  Thus  at  the  end  of  this  step,  the 
upper  and  lower  diagonals  are  eliminated  in  rows  /,  +  1  through  ut  —  1.  Moreover,  rows 
/,  and  u,  are  now  coupled  directly  to  each  other,  and  contain  no  entries  corresponding  to 
the  intermediate  rows  Thus  rows  R ,  iq  ,  l2,  u2, lp,  up  now  constitute  a  tridiagonal  system 
having  2 p  equa  ...  as  is  shown  by  the  highlighting  in  Figure  1. 

In  the  scco  :  .  ep,  the  pair  of  equations  corresponding  to  rows  /,  and  ut  on  each  processor 
are  “mailed”  to  mm  processor.  Half  of  the  processors  “receive”  two  pairs  of  equations, 
constituting  four  a-yacent  rows  of  the  matrix,  and  remain  “active.”  The  other  half  of  the 
processors  recG.-e  no  equations,  and  go  to  sleep.  Thus  one  active  processor  will  receive  rows 
l\,Ui,,2>u2,  another  rows  l3,  u3,  U,  ti4,  and  so  on. 
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Figure  1:  Reduction  in  the  first  step  of  the  elimination  process. 
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i-  Igure  2:  Reduction  of  four  rows  of  a  tridiagonal  system. 


Figure  3:  Data  flow  graph. 

These  four  equations  on  each  active  processor  are  then  reduced  to  two,  as  shown  in 
Figure  2,  just  as  in  the  first  step,  so  that  the  first  and  last  equations  on  each  active  processor 
are  directly  coupled.  The  result  is  a  tridiagonal  system  of  size  p.  This  process  continues, 
halving  the  size  of  the  tridiagonal  system  at  each  step  in  the  reduction.  After  log2(p )  such 
steps,  we  obtain  a  single  tridiagonal  system  having  four  rows,  which  we  solve  by  the  sequential 
Thomas  algorithm. 

During  the  log2{p)  steps  of  this  reduction  phase,  each  of  the  reduced  linear  systems 
occurring  must  be  saved.  Then  after  the  final  tridiagonal  system  with  four  equations  has 
been  solved,  we  perform  substitution  into  these  saved  reduced  systems,  in  the  inverse  order 
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Figure  4:  Computation  of  intermediate  values  x1  and  x2 

in  which  they  were  created,  and  finally  recover  the  solution  of  the  original  system.  The 
overall  data  flow  graph  of  this  is  substructured  algorithm  is  shown  in  Figure  3. 

The  substitution  process  itself  is  quite  trivial.  At  each  step,  each  of  the  active  processors 
must  compute  its  share  of  the  solution  of  a  reduced  system.  Each  processor  receives  the  first 
and  last  values  of  the  solution  of  this  part  of  the  reduced  system,  as  shown  in  Figure  4,  and 
computes  the  intermediate  values,  as  shown.  In  the  first  /o^2(p)  —  1  steps  of  the  substitution 
phase,  two  intermediate  solution  values  need  to  be  computed,  and  then  these  4  solution 
values  are  mailed  to  the  processors  needing  them  in  the  next  step.  In  the  last  step,  each 
processor  computes  njp  —  2  solution  values,  completing  the  solution. 


Mapping  of  data  flow  graph  unto  processor  array 

The  data  flow  graph  of  this  substructured  algorithm  is  shown  in  Figure  3.  During  the 
reduction  phase,  the  number  of  active  processors  is  reduced  by  two  at  each  step,  until  finally 
we  have  just  one  active  processor.  During  the  substitution  phase,  the  number  of  active 
processors  doubles  at  each  stage. 

There  are  various  ways  of  mapping  this  data  flow  graph  onto  a  multiprocessor  architec¬ 
ture.  One  of  the  simplest  is  the  shuffle/unshuffle  mapping  shown  in  Figure  5.  This  mapping 
is  easy  to  program,  and  is  advantageous  when  there  are  multiple  tridiagonal  system  to  be 
solved,  as  we  will  show  later. 


KF1  representation  of  algorithm 

It  is  relatively  easy  to  describe  this  kind  of  divide  and  conquer  algorithm  in  KFl. 
Listing  4  gives  the  KFl  code  for  this  algorithm.  Subroutine  hi  takes  as  input  vectors  b.  a, 
and  c  representing  the  matrix  A  and  the  right  hand  side  /,  and  returns  the  solution  vector 
X .  These  vectors  are  declared  to  be  distributed  by  blocks  across  a  one-dimensional  processor 
array  procs  of  size  p.  Note  that  the  routine  uses  the  new  declaration  dynamic  for  temporary 
arrays,  e.g.  tmpa,  whose  sizes  can  be  determined  only  when  the  subroutine  is  invoked. 
Fortran  programmers  generally  perform  dynamic  allocation  “by  hand,”  by  indexing  into  the 
blank  common  area.  Such  techniques  are  awkward  and  difficult  to  handle  on  distributed 
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Figure  5:  Mapping  of  data  flow  graph. 

memory  machines,  hence  “dynamic  arrays”  have  been  included  in  KFl. 

The  outer  do  loop  here  executes  the  log2{p)  steps  of  the  algorithm.  In  each  step,  first  the 
internal  equations  are  eliminated  using  a  doall  loop  and  then  the  outer  two  equations  are 
sent  to  another  processor.  In  the  first  step,  the  number  of  internal  equations  to  be  eliminated 
depends  on  the  original  distribution  of  the  rows  of  the  matrix,  while  in  the  later  steps  each 
processor  reduces  a  system  of  four  equations.  The  first  doall  loop  is  executed  only  in  the 
first  step.  The  on  clause  “on  procs(ip)”  specifies  that  the  ipth  invocation  of  the  loop  is  to 
be  executed  on  processor  procs(ip).  Each  processor  uses  the  predefined  functions  lower  and 
upper,  to  determine  the  index  limits  for  the  block  of  equations  that  it  “owns”.  A  sequential 


13 


parsub  tri(  X,  f,  b,  a,  c,  n;  procs) 

processors  procs(p) 
real  X(n),  f(n)  dist(block) 
real  a(n),  b(n),  c(n)  dist(block) 
integer  lo,  hi,  step 

dynamic  real  tmpa(4*p),  tmpb(4*p)  dist(block) 
dynamic  real  tmpc(4*p),  tinpf(4*p)  dist(block) 

k  =  log2(p) 
do  1000  step  =  1,  k 
if  (stej)  .cq.  1)  then 

doall  100  ip  —  1 ,  p  on  procs(ip) 
lo  — -  lower  (X,  procs(ip)  ) 
hi  =  upper  (X,  procs(ip)  ) 

call  reduce(b(lo:hi),  a(lo:hi),  c(lo:hi),  f(lo:hi),  hi-lo+1) 
tmpb(4*ip-3)  =  b(lo) 
tmpb(4*ip)  =  b(  hi) 
c 

c  ...  code  to  set  up  tmpa,  tmpc,  tmpf  similarly 

c 

100  continue 

else 

doall  200  ip  =  1,  p  on  procs(ip) 
if  (  log2(ip)  +  step  .eq.  k+l  )  then 
lo  =  4*ip  -  3 
hi  —  4*ip 

call  reduce(tmpb(lo:hi),  tmpa(lo:hi),  tmpc(lo:hi),  tmpf(lo:hi),  4) 
cndi  f 

200  continue 

endif 

cal!  unshff(tmpb,  tmpa,  tmpc,  tmpf,  4*p,  step;  procs) 

1000  continue 
c 

c  ...  code  for  substitution  phase 

c 

return 

end 

Listing  4:  Code  for  a  tridiagonal  solve 
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routine  reduce  is  then  called  to  eliminate  the  internal  equations  of  the  block  of  equations 
owned  by  the  processor.  Routine  reduce  is  a  simple  sequential  linear  algebra  routine,  not 
shown.  After  elimination  of  the  internal  equations  in  the  first  step,  the  outer  two  equations 
are  transferred  to  temporary  arrays  to  be  communicated  to  the  next  step  of  the  elimination 

process. 

In  each  step,  the  routine  unshff  (shown  in  Listing  5)  is  called  to  permute  the  equations 
among  the  processors  as  needed.  Given  the  simple  distribution  pattern  here,  the  compiler 
can  convert  the  assignment  statements  representing  the  permutation  in  unshff,  into  sends 
and  receives  required  for  communicating  the  data1. 

After  the  equations  have  been  moved,  the  four  equations  in  each  of  the  active  processors 
are  again  reduced  to  two  by  calling  the  routine  reduce.  This  is  done  in  the  second  doall 
loop,  where  the  "if”  condition  controls  which  processors  are  active. 


Pipelined  parallel  tridiagonai  algorithm 

In  the  above  tridiagonai  algorithm,  the  number  of  active  processors  is  halved  at  each 
phase.  If  we  have  to  solve  more  than  one  tridiagonai  system  then  these  computations  can  be 
pipelined  so  that  more  of  the  processors  are  kept  busy.  Listing  6  shows  a  pipelined  version 
of  the  tridiagonai  solver.  The  subroutine  mtrix  accepts  m  tridiagonai  systems  each  of  size 
n  represented  by  m  x  n  arrays.  Note  that  the  second  dimension  of  the  data  arrays  here  is 
distributed  across  the  one  dimensional  processor  array,  i.e. ,  each  processor  contains  a  block 
of  each  of  the  m  tridiagonai  systems. 

Here,  in  each  of  the  first  m  steps,  a  new  set  of  equations  are  reduced  to  yield  a  set  of 
2 p  equations.  The  second  doall  loop  is  setup  such  that  different  subsets  of  the  processors 
"handle”  different  sets  of  equations  during  a  step.  The  routine  reduce  is  the  same  as  discussed 
before  while  routine  munshfis  similar  to  unshff  except  that  it  needs  to  know  which  particular 
set  of  equations  are  to  be  “unshufffed”. 


4  Two  Dimensional  Tensor  Product  Computations 

In  the  last  section  we  presented  simple  one  dimensional  kernel  algorithms  in  KF1.  This 
section  shows  how  such  one  dimensional  kernels  can  be  combined  for  two  dimensional  tensor 
product  calculations.  The  example  chosen  here  is  an  ADI  iteration.  Mapping  ADI  methods 
to  distributed  memory  architectures  has  been  previously  studied  by  several  groups[9,  14]. 

ADI  (Alternating  Direction  Implicit)  is  a  well  known  and  effective  method  for  solving 

1  The  transformation  here  would  be  trivial,  if  the  user  gave  a  pragma  specifying  “inline”  expansion  of 
procedure  unshff.  Without  such  a  pragma,  the  inter-procedural  analysis  required  is  difficult,  and  is  the  focus 
of  ongoing  research. 
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parsub  unshff(  b,  a,  c,  f,  n,  phase;  procs) 

processors  procs(p) 

real  b(n),  a(u),  c(n),  f(n)  dist  (block) 

integer  phase,  dest,  src 

k  =  log2(p) 

doall  100  ip  =  1,  p  on  procs(ip) 

if  (  log2(ip)-j-phase  .eq.  k-t-1  )  then 
dest  =  lower  (b,  procs) 
src  =  (2*lo)  %  n 

b(dest)  =  b(src-l) 
b(dcst  +  l)  =  b(src+2) 
b(dest+2)  =  b(src+3) 
b(dest+3)  =  b(src+6) 
c 

c  ...  similarly  for  a,  c,  and  f 

c 

endif 

100  continue 

ret  urn 
end 

Listing  5:  Routine  to  “unshuffle”  equations. 

partial  differential  equations  in  two  or  more  dimensions[15,  19].  It  is  widely  used  in  compu¬ 
tational  fluid  dynamics,  and  other  areas  of  computational  physics.  The  name  ADI  derives 
from  the  fact  that  “implicit”  equations  are  solved  in  both  the  x  and  y  directions  at  each 
step.  These  implicit  equations  are  often  tridiagonal  systems,  and  can  be  solved  by  parallel 
tridiagonal  solvers,  such  as  those  described  in  the  last  section.  Thus  the  main  task  here  is 
to  show  how  calls  to  these  parallel  tridiagonal  routines  are  combined,  to  specify  the  ADI 
algorithm.  Almost  any  working  numerical  analyst  would  know  how  one  does  that  in  Fortran. 
With  a  language  like  KF1,  the  same  approach  yields  a  distributed  parallel  implementation. 

Mathematically,  ADI  works  as  follows.  Suppose  one  is  trying  to  solve  a  partial  differential 

equation: 

y)^ xx  T  V)^ yy  T  c(x,  y)U  f 

Viewing  this  as  an  operator  equation 

LU  =  f,  (1) 
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parsub  mtrix(  X,  f,  b.  a,  c,  m,  n;  procs) 


processors  procs(p) 

real  X(m,  n),  f(m,  n)  dist(*,  block) 

real  a(m,  n),  b(m,  n),  c(m,  n)  dist(*,  block) 

integer  lo,  hi,  step 


dynamic  real  tmpa(m,  4*p),  tmpb(m,  4*p)  dist(*,  block) 
dynamic  real  tmpc(m,  4*p),  tmpf(m,  4*p)  dist(*,  block) 

k  =  log2(p) 

do  1000  step  =  1,  m-rk 
if  (step  .le.  m)  then 

doall  ip  =  1,  p  on  procs(ip) 

lo  =  lower  (X(step,  *),  procs(ip)  ) 
hi  =  upper  (X(step,  *),  procs(ip)  ) 
call  reduce(b(step,  lo:hi),  a(step,  lo:hi), 
fc  c(step,  lo:hi),  f(step,  lo:hi),  hi-lo+1) 

tmpb(step,  4*ip-3)  =  b(step,  lo) 

tmpb(step,  4*ip)  =  b(step,  hi) 

c 

c  ...  code  to  set  up  tmpa,  tmpc,  tmpf 

c 

100  continue 

else 

doall  ip  =  1,  p  on  procs(ip) 
j  =  step  +  log2(ip)  -  k 
if  (j  .gt.  1  .and.  j  .le.  m)  then 
lo  =  4*ip  -  3 
hi  =  4*ip 

call  reduce(tmpb(j,  lo:hi),  tmpa(j,  lo:hi), 

&  tmpc(j,  lo:hi),  tmpf(j,  lo:hi),  4) 

endif 

200  continue 

endif 

call  munshf(tmpb,  tmpa,  tmpc,  tmpf,  4*p,  step;  procs) 
continue 

...  code  for  substitution  phase 

return 
end 

Listing  6:  Pipelined  version  to  solve  m  tridiagonal  systems 
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one  decomposes  L  into  a  sum  of  two  parts: 


InU  +  L2U  =  /, 


where 

Q 

L\  =  a(x,y)-^— +  c(x,y)j2 
L2  =  b(x,y)~  +  c(x,y)/ 2 

Then  instead  of  directly  solving  equation  1,  one  successively  solves 

(Li  +  I)v  =  f  (2) 

and 

( L2  +  I)u  =  v.  (3) 

Carrying  out  these  two  operations  gives  a  first  approximation  to  the  solution  of  equation  1. 
Afier  this  one  replaces  the  right  hand  side  /  by  the  residual 

r  =  Lu-  /,  (4) 

and  repeats  the  process.  Continuing,  one  has  an  efficient  iterative  method,  which  converges 
to  the  solution  of  equation  1. 

The  advantage  of  this  algorithm  over  competing  iterative  methods  is  that  it  converges 
quite  rapidly,  and  the  solutions  to  the  equations  2  and  3  only  require  inexpensive  tridiagonal 
solves.  Listing  7  presents  this  algorithm  in  KF1.  This  version  of  ADI  uses  the  non-pipelined 
parallel  tridiagonal  solver.  Here  resid.  is  a  simple  subroutine  which  forms  the  residual  of 
equation  4.  This  residual  computation  is  similar  to  one  step  of  a  Jacobi  iteration,  and 
induces  the  same  communication. 

The  on  clauses  here  force  each  loop  invocation  to  be  performed  on  the  appropriate  slice 
of  the  two-dimensional  processor  array.  The  construct  “owner ((r(i,  *))”  specifies  the  set  of 
processors  which  own  the  fth  row  of  the  array  r.  The  ith  tridiagonal  system  in  the  y-direction 
is  solved  by  calling  the  subroutine,  trie ,  as  follows: 

call  tric(v(i,  *),  r(i,  *),  bO,  bi,  bO,  ny;  owner(r(i,  *))) 


The  routine  is  passed  a  slice  of  the  data  arrays  v  and  r,  along  with  the  slice  of  the  proces¬ 
sor  array  on  which  these  \alues  reside  so  that  it  can  execute  in  parallel.  The  tridiagonal 
solver,  trie,  here  is  just  the  constant  coefficient  version  of  routine  tri  presented  in  section  3. 
Programming  ADI  with  variable  coefficients  is  not  much  different,  except  that  there  are  a 
number  of  additional  details  not  germane  to  this  paper. 
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pat-sub  adi  (u,  f,  nxp,  nyp;  procs) 


processors  procs(px,  py) 

real  u(0:nxp,  0:nyp),  f(0:nxp,  0:nyp)  dist  (block,  block) 
dynamic  real  r(0:nxp,  O.nyp),  v(0:nxp,  0:nyp)  dist  (block,  block) 

common  /params/  maxits,  a,  b,  c 

c 

c  ADI  iteration  for  the  constant  coefficient  problem 

c  a  *  U  xx  -r  b'Ujj  -r  c*U  »  F 

c 

nx  —  nxp— 1 
ay  —  nyp-1 

aO  =  a/(nx*nx) 
al  =  c  -  2*a0 
bO  =  b/(ny*ny) 
bl  *  c  -  2*b0 

do  1000  it  =  1,  maxits 

call  rcsid(r,  u,  f,  nx,  ny;  procs) 
c 

c  perform  tridiagonal  solves  in  y  direction 

c 

doall  100  i  =  1,  nx  on  owner(r(i,  *)) 

call  tric( v(i,  *),  r(i,  *),  bO,  bl,  bO,  ny;  owner(r(i,  *))  ) 

100  continue 

c 

c  perform  tridiagonal  solves  in  x  direction 

c 

doall  200  j  =  1,  ny  on  owner(v(*,  j)) 

call  tric(u(*,  j),  v(*,  j),  aO,  al,  aO,  nx;  owncr(v(*,  j)  ) 


200 

continue 

1000 

continue 

return 

e  nd 

Listing  7:  ADI  Algorithm 
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parsub  madi  (u,  f,  nxp,  nyp;  procs) 
processors  procs(px,  py) 

real  u(0:nxp,  0:nyp),  f(0:nxp,  0:nyp)  dist  (block,  block) 
dynamic  real  r(0:nxp,  0:nyp),  v(0:nxp,  0:nyp)  dist  (block,  block) 
integer  lo,  hi 

common  /params/  maxits,  a,  b,  c 

nx  =  a  x  p  —  1 
ny  =  nyp- 1 

a/(  r.x“  r.x) 
c. 2  2'a«> 

b  1  r.  y  ' : .  y 
c  2  J 

do  1  in)'.)  i:  1 . 

call  res;::  r.  :.v;  procs) 

c  pt  rform  l  nd:<:g<  solves  tr:  ?/  direction 

doall  100  ip  ,  px  on  procs(:p,  *) 

!o  lower(v,  procs(ip,  *),  1) 
hi  =  upper( v,  procs(ip,  *),  1) 

call  mtrixc(v(lo:hi,  *),  r(lo:hi,  *),  bO,  bl,  bO,  hi-lo+1,  ny;  procs(ip,  *)) 
100  continue 

c  perform  tridiagonal  solves  in  x  direction 

doa'.l  200  jp  =  1,  py  on  procs(*,  jp) 
lo  =  lo\ver(v,  procs(*,  jp),  2) 
hi  =  upper(v,  procs(+,  jp),  2) 

call  mtriyc(u(*,  lo:hi),  v(*,  lo: hi) ,  aO,  al,  aO,  nx,  hi  —  lo-*- 1;  procs(*,  jp)) 
200  continue 

1000  continue 

return 

end 


aO 
al  - 
bO 


Listing  8:  Pipelined  ADI  Algorithm 
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\  1  ^ 


•  of  i he  non-pipefined  tridin 
snares  m  tire  solution  o:  nx 
,  my  tridiagonal  systems  dt: 
in  t:.e  mpemaxi  version  ot  t 
I  >.  Abilin,  t he  parallel  tridt; 
i  ot  subroutine  minx  o'.  sec 
'  a r raws  bussed  as  ;i raume 


.gonal  solver  liere  is  somewhat  inefficient,  since  cacti  pro- 
mu  tridiagonal  systems  during  the  y-direction  solution, 
iring  the  x-direction  solutions.  One  can  get  better  speed- 
lie  tridiagonal  solver.  The  improved  algorithm  is  given  in 
agonal  solver,  mtrixe,  used  here  is  the  constant  coefficient 
tion  3.  Separate  routines  mtrixe  and  mtriyc  are  needed, 
•r.ts  will  be  transposed  during  the  y-direction  part  of  the 


On. 


ini-,  of  d::f<  rcr.co  between  this  second  pipelined  version  of  ADI,  and  the 


ifii vcrme  : 

:;.cox’ 

*<i  w ! 

dci:  one  leaves  "strip-mining  ’ 

of  parallel  Soups  to 

principal. 

.  a 

(  i  »r 

a!  compiler  cmtld  generate  t h i: 

>  second  version  ot 

■>:.  -mum. 

■  !)V  •  cj 

•  rest  met 

uring.  This  is  rather  difficult 

fi.owever,  since  *he 

first  have  : ■  >  menu'  ipm  calls  to  sub:  ■  utme  tn,  and  then  reschedule  the 
,e  tridiagonal  <•  1v«t  it:  implicated  wavs,  i  his  is  well  beyond  the  capabilities 
mil  •:>:  f  r  the  ;  r»  .  pr-  'gra turners  wishing  impruveci  pt  rMrntance  will  have 


5  Higher  Dimensional  Tensor  Product  Algorithms 


i  tie 


o  ctm.ensiouul  tensor  prod  net  computations.  One  <:<>ui<l 

n , rr u  n  . ’-!  u_  . i. 


1  v  nr  gram  such  also  ’tit:. ins  in.  Occam-like  languages,  though  it  would  he  awkward, 
•r.  •  hat  ceases  o>  be  t  he  case  when  one  looks  at  more  complex  tensor  product  algo- 
in  this  section  we  look  a*  a  three  dimensional  mu  It  igrid  algorithm  having  complexity 
..king  that  ul  real  applications.  In  our  view,  having  a  “tool  like  KFl  would  be  a  v.r- 
cessitv  f  r  routine  programming  of  algorithms  like  this.  The  idea  of  designing  an  i 
inv  such  algorithms  in  at;  Occam-like  language  is  rath.er  daunting. 


net  algorithms  are  widely  used  for  three  dimensional  numerical  computations, 
amines  the  expression  of  a  three  dimensional  multigrid  algorithm  based  on 
ion.  This  algorithm  is  moderately  complex,  and  is  interesting  since  the  plane 
in  the  zebra  relaxation  are  themselves  tensor  product  multigrid  algorithms 


The  mwtigr.d  algorithm  here  is  one  of  a  number  of  related  multigrid  algorithms  based 
on  plane  relaxation.  This  particular  one  uses  “semi-coarsening”  and  plane  relaxations  in 
onlv  one  direction  3,  4  .  We  mention  these  numerical  details  only  to  make  it  clear  that  this 
algorithm  is  not  artificially  complex.  Quite  the  opposite  in  fact;  algorithms  of  much  greater 
complexity  are  routinely  used  for  modeling  of  physical  problems. 

I  owing  0  presents  the  subroutine  nig 3  which  performs  one  multigrid  iteration  on  a  three 
dimensional  rectangular  grid.  We  assume  the  partial  differential  equation  being  solved  is 
,t  simple  Poisson-like  equation,  with  constant  coefficients,  and  also  assume  homogeneous 
Dirici.let  ’boundary  conditions. 
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pursub  mg3(u,  f,  nx,  ny,  nz;  proc  ) 


processors  procs(px,  py; 

real  u.0:nx,  0:ny,  0:nz),  f(0:nx,  0:ny,  Q:nz)  dist  (*,  block,  block) 
dynamic  real  r(0:nx,  0:ny,  0:nz)  dist  (*,  block,  block) 

dynamic  real  v(0:nx,  0:ny,  Cknz/2),  v(0:nx,  0:ny,  0:nz/2)  dist  (*,  block,  block) 
pc '•form  ebra  relaxation  on  even  planes 

C 

call  resid3(r,  u,  f;  procs) 

doall  100  k  =  2,  nz--2,  2  on  o\vner(u(*.  *,  k)) 
call  mg2(u(*,  *,  k),  r(*,  *,  k);  owncr(u(*,  *,  k))) 

O1.'1  continue 

c 

c  perform  zebra  relaxation  on  odd  planes 

c 

call  resid3(r,  u,  f;  procs) 

doall  200  k  =  1 ,  nz-1,  2  on  o\vner(u(*,  *,  k)) 
call  mg2(u(*,  *,  k),  r(*.  *,  k);  o\vner(u(*,  *,  k))) 

200  continue 

c 

c  recursively  solve  coarse  grid  problem 

c 

if  (nz  .gt.  2)  then 

call  resid3(r,  u,  f;  procs) 
call  rest3(g,  r;  procs) 

call  mg3(v,  g;  procs) 
call  intrp3(u,  v;  procs) 
endif 

ret  urn 
end 

Listing  9:  Three  Dimensional  Multigrid  Solver 

There  are  two  basic  operations  here,  zebra  relaxation  and  solution  of  coarse  grid  problems. 
The  zebra  relaxations  are  given  by  the  two  doall  loops.  The  first  performs  half  of  a  zebra 
sweep  by  visiting  the  odd  planes,  while  the  second  visits  the  even  planes,  to  complete  the 
zebra  relaxation.  The  calls  to  rcsid3  before  each  doall  loop  compile  the  “residual,”  that  is, 
the  amount  by  which  we  currently  fail  to  satisfy  the  differential  equation.  The  relaxation 
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pnrsub  v,  :ix,  r.v.  nxi,  nzc;  procs) 

processors  piws^px,  ny; 

real  u'0:xx,  0:xy,  U::izf),  v(0:nx,  Oaiy,  0:nzc)  dist  (*,  block,  block) 
c 

c  ek  whether  coarse  and  fine  grid  dimensions  are  properly  related 

c 

if  :..:f  .r.e.  2'r.zc)  call  error'' "Dimensions  do  not  match  in  intrp3") 

do  300  i  --  1 ,  r.x  1 

c  'codify  values  on  even  planes  ... 

c 

doall  100  (j.  !D  ---  1,  ny-  1;  “  [2,  nzf-2,  2]  on  o\vncr(u(*,  j,  k)) 

•Zi,  Jr  k)  •  u-J,  j.  kj  ~  v(i,  j.  k/2) 

■)  continue 

c  modify  values  on  odd  planes  ... 

c 

doall  2 00  (j,  k)  &  [1,  ny-  1;  *  [1,  nzf-1,  2]  on  owner(u(*,  j,  k)) 

kp  --  (k-r- 1 )/ 2 

km  -  (k  -  l)/2 

i:(i,  j,  k)  --  u(i,  j,  k)  -r  0.5*(v(i,  j,  kp)  +  v(i,  j,  km)) 

20e)  continue 

3'i'J  continue 

ret  urn 
end 

Listing  10:  Three  Dimensional  Interpolation  Routine 

d sc::  :s  performed  by  the  calls  to  rng2,  each  of  which  “solves”  a  two  dimensional  partial 
differentia!  equation  for  the  solution  values  on  one  plane. 

The  heart  of  the  multigrid  algorithm  is  the  “coarse  grid  correction”  in  which  we  recur¬ 
sively  call  subroutine  mg3  with  a  smaller  problem  to  accelerate  the  convergence.  In  the 
algorithm  here,  subroutine  mg 3  is  called  with  arrays  v  and  g,  which  are  half  as  large  as  the 
original  arrays  u  and  /,  since  the  number  of  points  in  the  z-direction,  nz,  is  halved. 

Postponing  temporarily  issues  related  to  parallelism,  lets  look  briefly  at  the  various  kernel 
subroutines.  Subroutine  intrp3  is  typical.  One  possible  variant  of  this  routine  is  given  in 
Listing  Id.  This  subroutine  modifies  the  current  fine  grid  solution  u,  using  the  value  of 


the  coarse  grid  correction  v.  Since  v  exists  only  on  a  coarse  grid  having  half  as  many 
points,  values  at  intermediate  points  must  be  computed  by  interpolation.  The  simple  linear 
interpolation  formula 


0.5  *  (■ v(i,j,kp )  +  v(i,j,km)), 

giving  the  intermediate  value  as  the  average  of  the  two  nearest  values,  is  used  here.  Note 
that  more  efficient  execution  would  probably  be  achieved  if  the  sequential  i  loop  was  nested 
inside  the  doall  loops.  This  is  a  trivial  program  transformation  which  a  good  compiler  should 
be  able  to  perform. 

Subroutines  rcsid.3  and  rcstS  are  analogous  to  intrp3 ,  except  the  numerical  formulas 
occurring  are  more  involved.  Subroutine  mg2,  shown  in  Listing  11,  is  the  two  dimensional 
analog  of  mg3. 

The  reader  interested  in  numerical  details  is  referred  to  [3,  4,  10,  22].  Our  primary 
interest  here  is  in  the  expressiveness  of  the  language  constructs.  Readers  familiar  with  the 
programming  of  such  algorithms  in  sequential  Fortran  should  be  able  to  see  that  there  is 
very  little  difference  between  the  programming  of  such  algorithms  in  Fortran  and  in  KF1. 

The  other  interesting  issue  here  is  that  of  parallelism.  The  heart  of  this  issue  is  the  calls 
to  mg2  in  the  zebra  relaxations: 

c 

c  perform  zebra  relaxation  on  even  planes 
c 

call  resid3(r,  u,  f;  procs) 

doall  100  k  =  2,  nz-2,  2  on  o\vner(u(*,  *,  k)) 

call  mg2(u(V,k),  r(V,k);  owner(u(*,  *,  k))) 

100  continue 


Given  our  initial  blocking  of  the  three  dimensional  arrays 
processors  procs(px,  py) 

real  u(0:nx,  0:ny,  0:nz),  f(0:nx,  0:ny,  0:nz)  dist  (*,  block,  block) 


subroutine  mg2  is  passed  a  slice  of  the  processor  arrays  corresponding  to  the  x-y  planes: 


k)  and  r(  +  ,  *,  k) 


Thus  subroutine  nig 2  inherits  a  one  dimensional  processor  array,  while  its  kernel  tridiagonai 
solver,  scqtri,  runs  sequentially. 
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parsub  mg2(u,  f,  nx,  ny;  procs) 
processors  procs(px,  py) 

real  u(0:nx,  0:ny),  f(0:nx,  0:ny)  dist  (*,  block) 

dynamic  real  r(0:nx,  0:ny)  dist  (*,  block) 

dynamic  real  v(0:nx,  0:ny/2),  v(0:nx,  0:ny/2)  dist  (*,  block) 

c 

c  perform  zebra  relaxation  on  ever,  planes 

c 

call  resid2(r,  u,  f;  procs) 
doall  100  j  =  2,  ny  — 2,  2  on  ownor(  u(r,  j)  ) 
call  scqtri(u(*)  j),  r(*,  j)) 
i  00  continue 

c 

c  perform  zebra  relaxation  on  odd  planes 

c 

call  resid2(T,  u,  f;  procs) 
doall  200  j  =  1,  ny-1,  2  on  owner!  u(*,  j)  ) 
call  mg2(u(*,  j),  r(*.  j)) 

200  continue 

c 

c  recursively  solve  coarse  grid  problem 

c 

if  (ny  .gt.  2)  then 

call  resid2(r,  u,  f;  procs) 
call  rest2(g,  r;  procs) 

call  mg2(v,  g;  procs) 
call  intrp3(u,  v;  procs) 
endif 

ret  urn 
end 


Listing  11:  Two  Dimensional  Multigrid  Solver 


We  could  have  done  things  differently  by  changing  the  dimensionality  of  the  original 
processor  array  in  mg3.  Had  we  used  a  three  dimensional  processor  array  there,  the  tridiag¬ 
onal  solves  in  mg2  would  have  been  parallel.  Conversely,  if  we  had  used  a  one  dimensional 
processor  array,  with  the  three  dimensional  arrays  distributed 

real  u(0:nx,0:ny,0:nz),  f(0:nx,0:ny,0:nz)  dist  (*,*,  block) 
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i  :.f  b<-st  alternative  here  depends  on  t he  problem  size,  the  number  of  processors  in  the 
a-e-.’.’ec'  ure.  t he  cost  <«i  communication,  and  so  on.  Such  issues  are  outside  the  scope  of 
paper.  However,  it  is  won::  moling  that  with  KFl-like  languages,  one  can  experiment 
alternate  array  distributions  very  easily.  It  is  only  necessary  to  change  the  blocking 
n  o.  •  t : .e  urrav  declarations.  and  change  some  of  the  doall  to  do  loops,  or  converse!’/. 

!  is  :::e:.t a  1  diuerence  bet  ween,  this  kind  of  language,  and  Occam-style  languages. 

\\  message  passing  languages,  the  structure  of  the  array  distributions  will  be  embedded 
at  the  .urogram,  and  'bus  very  difficult  to  change. 


G  Conclusions 


As  distributed  memory  architectures  change  from  awkward  research  curiousities,  to  pro¬ 
em  c:  ion  machines  for  real  applications,  more  attention  needs  to  be  focused  on  programming 
environments.  Ib.pericnce  has  shown  that  message  passing  languages  suffice,  but  are  ex¬ 
tremely  awkward  for  complex  algorithms  such  as  those  described  here.  The  message  passing 
•/«’»•« :.->n  o*  a  program  is  often  five  to  ten  times  longer  than  the  sequential  version.  In  addition 
the  intricate  “message  plumbing”  makes  programs  difficult  to  debug,  and  “hard  wires”  all 
algorithm  choices,  preventing  easy  experimentation  with  alternate  algorithm  designs. 

One  fundamental  problem  with  such  message  passing  languages  is  that  they  provide  no 
support  f‘>:  “distributed  procedures.”  On  top  of  this,  the  massive  amounts  of  low  level 
message  passing  detail  required  to  express  complex  algorithms  is  a  severe  impediment.  Lan¬ 
guages  like  KFi  and  those  in  related  projects,  [l,  21],  appear  to  provide  a  better  alternative. 
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op'over.  the  executable  code  generated  will  be  efficient,  since  the  programmer  retains 
■ritrol  over  data  distribution  and  scheduling.  In  the  examples  here,  there  would  be  no 
iference  between  the  execution  time  of  algorithms  expressed  in  KFI,  and  those  expressed 
a  message  passing  language,  assuming  equally  good  back-end  machine  code  generators, 
he  price  of  using  KFI  instead  of  a  message-passing  language  is  simply  slower  compilations, 
nee  ‘here  are  additional  compiler  transformations  to  be  performed. 


However,  despite  the  ubiquity  and  importance  of  tensor  product  algorithms,  it  seems 
clear  that  the  usability  and  generality  of  programming  constructs  for  distributed  memory 
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nrvhiffcturvs  v;:'.’.  be  doternuued  largely  by  their  success  on  more  complex  problems,  such  as 
oc  ::.v  C.'ixg  adaptive  or  irregular  grids  and  general  sparse  matrices.  We  are  addressing 
these  issues  in  the  Kali  project  as  well,  in  this  paper  we  have  focused  on  the  simpler  problem 
■  c  expressing  tensor  product  algorithms,  since  this  topic  raises  a  concise  set  of  issues,  and 
;:..s  intrinsic  importance.  We  also  believe  that  many  of  the  ideas  here  apply  as  well  to  more 
-  m._.v  ‘  ■  uic; its .  i  tins  t no  rt Lu\  o:  icusoi  pivuitict  ai^Onthins  can  be  fhr'U'Wd  r,f  as  a 

valuable  ju:::ping-o:f- point  for  design  of  language  constructs  for  harder  problems. 

A  c  mpiler  for  Kbl  is  being  implemented,  and  a  compiler  for  a  related  Pascal-like  lan¬ 
guage  has  been  implemented  by  CL  Koelbel  at  Purdue  University  as  part  of  his  thesis  research 
If  .  We  plan  to  compare  the  run-time  performance  and  expressivity  of  KF1  with  that  of 
related  languages  fur  distributed  memory  machines  in  the  corning  months.  However,  raw 
performance  is  clear  1\  not  the  real  issue;  the  deeper  issue  is  “usability.”  How  do  users  wish 
to  express  algorit  hrr.s,  what  level  of  control  do  they  need/ want,  which  kinds  of  constructs 
xm  urah  In  the  end.  it  is  ut>  to  the  user  community  to  resolve  these  issues. 
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